Systematic Biology
◐ Oxford University Press (OUP)
Preprints posted in the last 90 days, ranked by how well they match Systematic Biology's content profile, based on 121 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Milkey, A.; Chen, J.; Lewis, P. O.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWAs modern phylogenomics datasets become increasingly large, it is useful to develop recommendations for how to subsample datasets for best species tree inference. Here we apply a new measure of phylogenetic information content that estimates the reduction in tree space occupied by a posterior sample of inferred trees relative to a prior sample in order to assess the effects of gene tree parameters on species tree estimation. We find that, consistent with earlier studies, when data are informative, more data result in better species tree inference. However, when data are uninformative, subsampling a dataset to include only the most informative loci may produce a better species tree sample. We perform analyses on a variety of simulated and empirical datasets.
Ren, H.; Jiang, C.; Wong, T. K. F.; Shao, Y.; Susko, E.; Minh, B. Q.; Lanfear, R.
Show abstract
Partitioned and mixture models are widely employed in Maximum Likelihood phylogenetic analyses of large genomic datasets. Comparing the fit of the two types of models has been challenging, because standard information-theoretic approaches cannot be applied. Mixture models are increasingly popular for the analysis of amino acid datasets and can lead to different conclusions compared to partitioned models. This raises an important question - which type of model tends to perform better? Susko et al. (2026) recently introduced the marginal Akaike information criterion (mAIC), which allows mixture models and partitioned models to be directly compared for the first time. Here, we use the mAIC and a range of other approaches to compare the fit of mixture and partitioned models across a diverse set of empirical datasets. We show that mixture models are universally favoured on amino acid datasets. This has important implications for interpreting empirical analyses and suggests that continued development of mixture models is an important avenue for future research.
Nagel, A. A.; Landis, M. J.
Show abstract
Ancestral state reconstruction is a classical problem of broad relevance in phylogenetics. Likelihood-based methods for reconstructing ancestral states under discrete character models, such as Markov models, have proven extremely useful, but only work so long as the assumed model yields a tractable likelihood function. Unfortunately, extending a simple but tractable phylogenetic model to possess new, but biologically realistic, properties often results in an intractable likelihood, preventing its use in standard modeling tasks, including ancestral state reconstruction. The rapid advancement of deep learning offers a potential alternative to likelihood-based inference of ancestral states, particularly for models with intractable likelihoods. In this study, we modify the phylogenetic deep learning software O_SCPLOWPHYDDLEC_SCPLOW to conduct ancestral state reconstruction. We evaluate O_SCPLOWPHYDDLEC_SCPLOWs performance under various methodological and modeling conditions, while comparing to Bayesian inference when possible. For simple models and small trees, its performance resembles the performance of Bayesian inference, but worsens as tree size increases. While O_SCPLOWPHYDDLEC_SCPLOW still performs adequately for more complex models, such as speciation and extinction models, the estimates differ more from Bayesian inference in comparison with simpler models. Lastly, we use O_SCPLOWPHYDDLEC_SCPLOW to infer ancestral states for two empirical datasets, one of the ancestral ranges of a subclade of the genus Liolaemus and ancestral locations for sequences from the 2014 Sierra Leone Ebola virus disease outbreak.
Khakurel, B.; Hoehna, S.
Show abstract
AbstractThe rate of evolution of a single morphological character is not homogeneous across the phylogeny and this rate heterogeneity varies between morphological characters. However, traditional models of morphological character evolution often assume that all characters evolve according to a time-homogeneous Markov process, which applies uniformly across the entire phylogeny. While models incorporating amongcharacter rate variation alleviate the assumption of the same rate for all characters, they still fail to address lineage-specific rate variation for individual characters. The covarion model, originally developed for molecular data to model the invariability of some sites for parts of the phylogeny, provides a promising framework for addressing this issue in morphological phylogenetics. In this study, we extend the covarion model in RevBayes to morphological character evolution, which we call the covariomorph model, and apply it to a diverse range of morphological datasets. Our covariomorph model utilizes multiple rate categories derived from a discretized probability distribution, which scales rate matrices accordingly. Characters are allowed to evolve within any of these rate categories, with the possibility of switching between rate categories during the evolutionary process. We verified our implementation of the covariomorph model with the help of simulations. Additionally, we examined 164 empirical datasets, finding patterns of rate heterogeneity compatible with covarion-like dynamics in approximately half of them. Upon further examination of two focal datasets that exhibited covarion-like rate variation, we found that the covariomorph model provides a more nuanced approach to incorporate rate variation across lineages, significantly affecting the resulting tree topology and branch lengths compared to traditional models. The observed sensitivity of branch lengths to model choice underscores potential implications of this approach for divergence time estimation and evolutionary rate calculations. By accounting for lineageand character-specific rate shifts, the covariomorph model offers a robust framework to improve the accuracy of morphological phylogenetic inference.
Milkey, A.; Lewis, P. O.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWA new Bayesian measure of phylogenetic information content is introduced based on geodesic distances in treespace. The measure is based on the relative variance of phylogenetic trees sampled from the posterior distribution compared to the prior distribution. This ratio is expected to equal 1 if there is no information in the data about phylogeny and 0 if there is complete information. Trees can be scaled to have the same mean tree length to avoid dominance by edge length information and focus on topological information. The method scales well, requiring only that a valid sample can be obtained from both prior and posterior distributions. We show how dissonance (information conflict) among data sets can also be estimated. Both simulated and empirical examples are provided to illustrate that the new approach produces sensible and intuitive results.
Duke, J. D.; Guo, J.; Forest, F.; Gumbs, R.; McTavish, E. J.; Rosindell, J.
Show abstract
Time-scaled phylogenetic trees summarising evolutionary relationships are fundamental to many analyses in biology, from diversification rate estimation to conservation prioritisation. The most comprehensive available summary of these relationships, the Open Tree of Life, synthesises information from over two thousand studies into a supertree covering the full range of global biodiversity, but its use in downstream analyses is limited by the lack of divergence times. Previous work has mapped dates from Open Tree's database of trees to certain nodes in the supertree, but for the majority of nodes no date is available. While algorithms exist to interpolate missing dates in a tree, we found that their time and memory requirements scaled quadratically with the number of nodes, which made it computationally infeasible to run them on the entire tree. In this work, we describe novel date interpolation algorithms that scale linearly with the number of nodes. These enabled us to produce a distribution of fully-dated trees containing 2.3 million extant described species, greatly expanding the scope of feasible phylogenetic analyses. We illustrate the utility of these trees by computing the most robust estimate yet of the phylogenetic diversity of the complete tree of life, incorporating both topological and temporal uncertainty.
Mishra, S.; Hahn, M. W.
Show abstract
MotivationMany methods can be used to infer the number and timing of gene duplication and loss events from gene trees. Most such reconciliation methods use a model of gene duplication that does not include the coalescent process, or that restricts it in important ways. As a result, changes to tree topologies due to coalescence will incur a cost of extra duplications and losses using these methods, events that did not actually occur. ResultsHere, we present results from the multispecies coalescent with duplication and loss (MSC-DL) model, which allows for the unrestricted interaction between duplication, loss, and coalescence. Theoretical results show that even histories with only a single duplication event can lead to many more trees than are normally considered: for a species tree with 2 tips, 9 trees are possible, while with 6 tips, more than 19 million trees are possible; adding even a single loss almost doubles the number of possible topologies. The probabilities of different topologies and their branch lengths under the MSC-DL for trees with two species are calculated exactly, and we provide an approach for calculating such probabilities on larger trees. These results have important implications for the accuracy of reconciliation methods, ortholog identification methods, and our understanding of evolutionary histories of duplication and loss. Supplementary InformationSupplementary materials are available at https://github.com/smishra677/Distribution-of-Gene-Tree-Topologies-with-Duplication-Loss-and-Coalescence.
Ane, C.; Bastide, P.
Show abstract
Most phylogenetic comparative methods use a species-level phylogeny, ignoring the effect of incomplete lineage sorting (ILS) and hemiplasy on the traits of interest. We consider here a trait controlled additively by one or more unknown loci. Their gene trees may differ from the species phylogeny due to ILS, as modeled by the coalescent process. If the species phylogeny is a network, this process also accounts for gene flow, admixture or hybridization. Our model allows for polymorphism in the ancestral population at the root of the species phylogeny, and predicts heritable within-population variation due to ILS. Even if each locus evolves according to a Brownian motion, the joint distribution of all trait measurements is not generally Gaussian due to ILS. We provide a Gaussian approximation, named the Gaussian Coalescent, and show how to compute its variance matrix efficiently using a single traversal of the species phylogeny. In simulations, this model is much more accurate than the model ignoring ILS. In simulations and on a data set of tomato floral traits, it is favored over the standard Brownian motion model with extra within-population variance. The GC model opens new avenues for various phylogenetic comparative methods, accounting for hemiplasy and gene flow simultaneously. It is implemented in phylolm v2.7.0 and in PhyloTraits v1.2.0.
Serok, N.; Polonsky, K.; Ashkenazy, H.; Mayrose, I.; Thorne, J. L.; Pupko, T.
Show abstract
Multiple sequence alignment (MSA) inference is a central task in molecular evolution and comparative genomics, and the reliability of downstream analyses, including phylogenetic inference, depends critically on alignment quality. Despite this importance, most widely used MSA methods optimize the sum-of-pairs (SP) score, and relatively little attention has been paid to whether this objective function accurately reflects alignment accuracy. Here, we evaluate the performance of the SP score using simulated and empirical benchmark alignments. For each dataset, we compare alternative MSAs derived from the same unaligned sequences and quantify the relationship between their SP scores and their distances from a reference alignment. We show that the alignment with the optimal SP score often does not correspond to the most accurate alignment. To address this limitation, we develop deep-learning-based scoring functions that integrate a collection of MSA features. We first introduce Model 1, a regression model that predicts the distance of a given MSA from the reference alignment. Across simulated and empirical datasets, this learned score correlates more strongly with true alignment accuracy than the SP score. However, Model 1 is less effective at identifying the best alignment among alternatives. We therefore develop Model 2, which takes as input a set of alternative MSAs generated from the same sequences and predicts their relative ranking. Model 2 more accurately identifies the top-ranking MSA than the SP score, Model 1, and several widely used alignment programs. Using simulations, we show that selecting MSAs based on our approach leads to more accurate phylogenetic reconstructions.
De Maio, N.
Show abstract
Maximum likelihood phylogenetic methods are popular approaches for estimating evolutionary histories. These methods do not assume prior hypotheses regarding the shape of the phylogenetic tree, and this lack of prior assumptions can be useful in particular in case of idiosyncratic sampling patterns. For example, the rate at which species are sequenced can differ widely between lineages, with lineages more of interest to humans being usually sequenced more often than others. However, in some settings sampling can be lineage-agnostic. In genomic epidemiology, for example, the sequencing rate can change through time or across locations, but is often agnostic to the specific pathogen strain being sequenced. In this scenario, one expects that the abundance of a pathogen strain at a specific time and location in the host population is reflected in the relative abundance of that strain among the genomes sequenced at that time and location. Here, I show that this simple assumption, when appropriate and incorporated within maximum likelihood phylogenetics, can greatly improve the accuracy of phylogenetic inference. This is similar to the famous medical principle "when you hear hoofbeats, think of horses, not zebras". In our application this means that, when for example observing a (possibly incomplete) genome sequence that has a similar likelihood of belonging to multiple different strains, I aim to prioritize phylogenetic placement onto a common strain (the "horse", a common disease) rather than a rare one (the "zebra", a rare disease). I introduce and assess two separate approaches to achieve this. The first approach rescales the likelihood of a phylogenetic tree by the number of distinct binary topologies obtainable by arbitrarily resolving multifurcations in the tree. This approach is based on a new interpretation of multifurcating phylogenetic trees particularly relevant at low divergence: multifurcations represent a lack of signal for resolving the bifurcating topology rather than an instantaneous multifurcating event, and so a multifurcating tree is interpreted as the set of bifurcating trees consistent with the multifurcating one, rather than as a single multifurcating topology. The second approach instead includes a tree prior that assumes that genomes are sequenced at a rate proportional to their abundance. Both approaches favor phylogenetic placement at abundant lineages, and using simulations I show that both methods dramatically improve the accuracy of phylogenetic inference in scenarios like SARS-CoV-2 phylogenetics, where large multifurcations are common. This considerable impact is also observed in real pandemic-scale SARS-CoV-2 genome data, where accounting for lineage prevalence reduces phylogenetic uncertainty by around one order of magnitude. Both approaches were implemented as part of the free and open source phylogenetic software MAPLE v0.7.5.4 (https://github.com/NicolaDM/MAPLE).
Mao, Q.; Grünewald, S.
Show abstract
With the availability of full genomes of an ever-increasing number of species, many recent phylogenetic analyses have focused on datasets with thousands of loci. In the presence of incomplete lineage sorting (ILS), many gene trees will be discordant with the species tree, which can then be estimated with a supertree method. In a separate stream of research, the reconstruction of phylogenetic networks, which aim to detect and visualize reticulations in addition to the dominant phylogenetic tree, has become common practice. Level-1 networks have the feature that every node can be interpreted as an ancestor of a subset of the taxa of interest and no two distinct cycles share a common node. TriLoNet is a method that constructs a level-1 network on all taxa from 3-taxa networks, so-called trinets. This approach is similar to modern supertree methods, but the trinets are assigned based on a single sequence alignment. Here we present TriMouNet (Trinet Multilocus Network), which uses the tree topology and branch length distribution in gene trees of a multilocus dataset to infer best-fitting trinets, together with scores quantifying their statistical support. These trinets are then puzzled together into a network on all taxa in a TriLoNet fashion. Experiments on simulated and real datasets show that TriMouNet can identify reticulations with low false positive rate, if the gene trees are accurate. On the other hand, TriLoNet applied to the concatenation of all loci, tends to predict wrong reticulations as the consequence of violations of model assumptions.
Sapoval, N.; Nakhleh, L.
Show abstract
Gene tree parsimony (GTP) is a common approach for efficient reconciliation of multiple discordant gene tree phylogenies for the inference of a single species tree. However, despite the popularity of GTP methods due to their low computational costs, prior work has shown that some commonly employed parsimony costs are statistically inconsistent under the multispecies coalescent process. Furthermore, a fine-grained analysis of the inconsistency has indicated potentially complimentary behavior of duplication and deep coalescence costs for symmetric and asymmetric species trees. In this work, we prove inconsistency of GTP estimators for all linear combinations of duplication, loss and deep coalescence scores. We also explore empirical implications of this result evaluating inference results of several GTP cost schemes under varying levels of incomplete lineage sorting.
Tahmid, N.; Rhythm, S. I.; Bayzid, M. S.
Show abstract
Accurate species tree inference from genome-scale data is complicated by gene tree discordance, which can arise both from biological processes such as incomplete lineage sorting (ILS) and from technical factors such as gene tree estimation error (GTEE). While both factors reduce the accuracy of summary methods that infer species trees from gene trees, their relative impact and characteristic patterns remain poorly understood. Here, we systematically disentangle the effects of ILS and GTEE by simulating gene tree datasets with comparable overall discordance levels, but with discordance arising exclusively from either ILS or GTEE. Using widely employed summary methods such as ASTRAL and wQFM, we show that GTEE typically has a stronger detrimental effect on species tree accuracy than ILS, even at matched discordance levels. We further characterize the structure of gene tree distributions under these two sources of discordance and show that ILS induces a structured, constrained skew in quartet distributions, whereas GTEE generates more uniform, high-entropy noise that does not diminish with additional genes. Our results provide an empirical framework for a nuanced understanding of how ILS and GTEE shape gene tree distributions and influence species tree inference, and highlight the importance of appropriately distinguishing biological and estimation-driven discordance when inferring species trees from limited or noisy datasets.
Takazawa, Y.; Takeda, A.; Hayamizu, M.; Gascuel, O.
Show abstract
Phylogenetic analyses often require the summarization of multiple trees, e.g., in Bayesian analyses to obtain the centroid of the posterior distribution of trees, or to determine the consensus of a set of bootstrap trees. The majority-rule consensus tree is the most commonly used. It is easy to compute and minimizes the sum of Robinson-Foulds (RF) distances to the input trees. In mathematical terms, the majority-rule consensus tree is the median of the input trees with respect to the RF distance. However, due to the coarse nature of RF distance, which only considers whether two branches induce exactly the same bipartition of the taxa or not, highly unresolved trees can be produced when the phylogenetic signal is low. To overcome this limitation, we propose using median trees with respect to finer-grained dissimilarity measures between trees. These measures include a quartet distance between tree topologies, and transfer distances, which quantify the similarity between bipartitions, in contrast to the 0/1 view of RF. We describe fast heuristic consensus algorithms for transfer-based tree dissimilarities, capable of efficiently processing trees with thousands of taxa. Through evaluations on simulated datasets in both Bayesian and bootstrapping maximum-likelihood frameworks, our results show that our methods improve consensus tree resolution in scenarios with low to moderate phylogenetic signal, while providing better or comparable dissimilarities to the true phylogeny. Applying our methods to Mammal phylogeny and a large HIV dataset of over nine thousand taxa confirms the improvement with real data. These results demonstrate the usefulness of our new consensus tree methods for analyzing the large datasets that are available today. Our software, PhyloCRISP, is available from https://github.com/yukiregista/PhyloCRISP.
Guirguis, J.; Goodyear, L. E. B.; Pincheira-Donoso, D.
Show abstract
Phylogenetic modelling has consolidated as the analytical standard to address hypotheses about the patterns and dynamics of biodiversity in inter-specific contexts. These analyses are traditionally performed implementing phylogenetic linear models where single outcomes are regressed against multiple predictors without explicitly modelling the relationships amongst predictors. A prevailing, yet largely overlooked consequence of neglecting these relationships is what we introduce as Occams bias - a statistical distortion arising where the model has fewer cause-effect connections than predicted by theory. Here, we propose that Occams bias is likely to have impacted a wide range of inferences about ecological and evolutionary processes made from phylogenetic linear models across the literature, and thus, that the adoption of approaches to address this bias are critical. We present an empirical test of the long-standing hypothesis that interspecific variation in life-history traits influences the likelihood of extinction risk across 13,949 species of terrestrial vertebrates to show the impacts of Occams bias in phenomenological inference. Our study calls for a re-evaluation of hypotheses tested using the traditional linear modelling structure and advocate the use and continued development of multi-response model structures that account for all causal pathways in phylogenetic analyses.
Parsons, R.; Liu, Y.; Dua, P.; Markin, A.; Molloy, E.
Show abstract
MotivationASTRAL-pro is the leading method for reconstructing species trees under complex evolutionary scenarios involving gene duplication, loss, and coalescence. A major open question is whether ASTRAL-pro is statistically consistent under a unified model of these processes, called DLCoal. This question is challenging to address because ASTRAL-pro seeks a species tree that maximizes the number of four-taxon trees (called quartets) also displayed by the input (multi-copy) gene trees, excluding those induced by duplications and agglomerating those that are homeomorphic up to duplications. Critically, there is no notion of correctness when tagging gene tree vertices as duplication or speciation events in the context of deep coalescence. ResultsHere, we propose that a gene tree vertex is correctly tagged as a duplication if it is the most recent common ancestor of at least one pair of gene copies related via a duplication event. Under our definition, deep coalescence propagates duplication tags across gene tree vertices, sometimes resulting in the exclusion of quartets on orthologous gene copies. Nevertheless, we show that A-pro is statistically consistent under the DLCoal model for an exclusion-only version of its objective function, assuming the input gene trees are correctly rooted and tagged. To empirically evaluate this modification, we exclude "duplication quartets" in the related method TREE-QMC and find that it achieves similar accuracy to A-pro on simulated data under varying rates of deep coalescence, duplication and loss, and gene tree estimation error, as well as on a plant data set. Availability and ImplementationTREE-QMC-pro is available on Github: https://github.com/molloy-lab/TREE-QMC/tree/tqmc-pro.
Williams, C.; McGillycuddy, M.; Drobniak, S. M.; Bolker, B. M.; Warton, D. I.; Nakagawa, S.
Show abstract
Phylogenetic generalised linear mixed models (PGLMMs) help ecologists to distinguish ecological drivers from other processes shaping evolutionary patterns, yet existing implementations are often limited in distributional scope or computational speed. We compare five R packages for fitting PGLMMs and highlight the new covariance structure propto in the general-purpose GLMM package glmmTMB. Simulations show that glmmTMB fits PGLMMs faster overall than brms, MCMCglmm, INLA, and phyr, while producing similar model estimates. We present the first practical application of glmmTMB for fitting phylogenetic random effects using likelihood-based models that accommodate repeated measures, demonstrated through case studies of evolutionary trait data. By improving both speed and flexibility, glmmTMB broadens access to PGLMM and supports deeper insights into trait evolution and diversification.
Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.
Show abstract
Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely difficult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments. Full online versionhttps://www.biorxiv.org/content/10.1101/2025.11.20.689557
Soares, L. S.; Goncalves, L. T.; Guzman-Rodriguez, S.; Bombarely, A.; Freitas, L. B.
Show abstract
Reduced-representation sequencing approaches such as RAD-seq are widely used in population genomics and phylogenetics, particularly for non-model organisms. However, bioinformatics choices during data processing can strongly influence downstream analyses. One key but underexplored factor is the reference genome used for read alignment and SNP discovery. Here, we evaluate the effects of reference genome choice on RAD-seq analyses using multiple datasets spanning recent radiations in Petunia and Calibrachoa, and reference genomes that differ in phylogenetic relatedness. When using congeneric reference genomes, we observed highly consistent mapping rates, SNP recovery, and downstream population genomic patterns. In contrast, mapping to more distantly related genomes resulted in lower mapping rates and stronger effects on summary statistics. Despite these quantitative reductions, broader patterns of genetic structure and diversity, as well as evolutionary relationships, remained largely congruent across reference genomes. Overall, our results indicate that reference genome choice matters most when genomes are distantly related or when analyses target fine-scale genomic signals. For recent radiations with largely conserved genome structure, closely related reference genomes yield comparable SNP datasets and lead to the same biological conclusions regarding population structure and phylogenetic relationships. These findings provide practical guidance for RAD-seq studies in non-model systems, showing that congeneric reference genomes are sufficient for robust population and phylogenetic inference, and that more distantly related genomes can remain informative when no close reference is available.
Kim, D.; Gil, M.; Katoh, K.; Dessimoz, C.
Show abstract
In phylogenomics, gene tree reconstruction depends on multiple sequence alignment (MSA) and tree inference, and ongoing work continues to improve inference quality. Denser taxon sampling has been associated with improved gene tree inference, suggesting that adding homologs could be a practical route to higher accuracy as sequence databases continue to expand. However, adding sequences can influence multiple steps of typical inference pipelines, and little is known on its specific effect on the multiple sequence alignment, tree reconstruction, and rooting steps. We performed a large-scale empirical benchmark to quantify how homolog enrichment affects alignment and phylogenetic inference. Using an enrichment-impoverishment design and a measure of tree accuracy based on taxonomic congruence, we found that enrichment consistently improves tree inference quality, while effects on alignment quality are marginal. We show that this improvement is associated with accurate root placement on enriched trees when sensitive homolog search is accompanied. Notably, much of the benefit can be retained with relatively compact alignments produced by sequence addition. Building on these observations, we provide a tool, AmpliPhy, which efficiently improves phylogenetic reconstruction of protein families through homolog enrichment. The AmpliPhy open-source pipeline software is available at https://github.com/DessimozLab/ampliphy.